Back

Microbial Genomics

Microbiology Society

Preprints posted in the last 30 days, ranked by how well they match Microbial Genomics's content profile, based on 204 papers previously published here. The average preprint has a 0.11% match score for this journal, so anything above that is already an above-average fit.

1
Integrated Machine Learning-PanGWAS Reveals Chromosome-Encoded Persistence Networks and Plasmid Plasticity in Recurrent Urinary Tract Infection in Escherichia coli

Rajendran, S.; Nagarajan, S.; MOHAN S., S.

2026-05-22 infectious diseases 10.64898/2026.05.20.26353739 medRxiv
Top 0.1%
41.6%
Show abstract

Background: Recurrent urinary tract infections(rUTI) represent a major clinical challenge due to persistent clinical symptoms, repeated antibiotic exposure, and increased risk of multidrug resistance. Further clinical management of rUTI remains challenging, as existing diagnostic and treatment guidelines are largely designed for uncomplicated, acute infections. Though uropathogenic Escherichia coli (UPEC) is the predominant cause of community-acquired UTIs, pathogen-derived genomic features that may predispose certain E. coli strains to repeatedly establish infection are not fully understood. Methods: To comprehensively dissect distinct genetic signals across genomic compartments that distinguish rUTI-associated isolates from those causing sporadic infection, the pan-genome analysis in three different frameworks (i) Combined genomes (chromosome + plasmid), (ii) bacterial chromosomes only and (iii) plasmid-only was conducted. A comprehensive evaluation of population structure was performed using Gubbins, recombination-aware phylogeny IQTree, phylogroup distribution, pan-genome openness using Heaps law, and plasmidome architecture using MOBSUITE. Findings: Supervised machine learning models showed that the highest discriminatory performance was achieved using the combined genomic dataset (accuracy ~0.98), and integration of feature-selected genes with PanGWAS (Pyseer and Scoary) identified a robust set of recurrence-associated genes, namely cbtA, cbeA, and ldrD, which were consistently detected across machine learning and association frameworks. Subsequent association rule mining further revealed cooperative gene networks enriched in rUTI isolates, particularly involving toxin-antitoxin modules and metabolic regulators. Interpretation: Overall, this integrated ML-PanGWAS approach demonstrates that rUTI is a lineage-independent, polygenic phenotype encoded within a combined chromosomal-plasmid genomic context, providing new insights into the bacterial genomic architecture underlying recurrent disease and offering candidate biomarkers for future diagnostic and therapeutic development.

2
Genomic Characterization of the RyC collection: 50 Multidrug Resistant Clinical Isolates of Escherichia coli and Klebsiella spp.

Rodera-Fernandez, P.; Sastre-Dominguez, J.; Costas, C.; Alonso-del-Valle, A.; de la Fuente, J.; Hernandez-Garcia, M.; Canton, R.; Santos-Lopez, A.; San Millan, A.

2026-05-18 microbiology 10.64898/2026.05.18.725816 medRxiv
Top 0.1%
37.6%
Show abstract

Antimicrobial resistance (AMR) is a major global public health threat, and Enterobacterales producing extended-spectrum {beta}-lactamases (ESBLs) represent some of the most common and concerning pathogens in clinical settings. Importantly, the dissemination of these resistance mechanisms is largely driven by mobile genetic elements (MGEs), particularly plasmids. Advancing our understanding of AMR evolution through experimentation requires moving beyond domesticated laboratory strains and towards clinically relevant isolates. However, despite the abundance of genomic data in public repositories, there is a lack of well-characterised clinical collections available for experimental work. Here, we characterise the RyC collection, which includes 50 multidrug-resistant, ESBL-producing Escherichia coli and Klebsiella spp. strains isolated from the gut microbiota of hospitalised patients at Hospital Universitario Ramon y Cajal (Madrid, Spain). We generated high-quality genome assemblies for all strains using a combination of short- and long-read sequencing technologies. From these data, we performed a comprehensive characterisation of the pangenome, mobilome, resistome and defensome of the collection. We present the RyC collection as a robust and experimentally tractable resource to study AMR evolution and MGEs dynamics in clinically relevant bacterial backgrounds. Impact statementAntimicrobial resistance (AMR) is a growing global health threat driven by the rapid dissemination of resistance genes among clinically relevant bacteria. A major challenge in studying AMR evolution is the reliance on domesticated laboratory strains, which poorly represent the complexity of pathogens circulating in hospitals. Here, we introduce the RyC collection, a set of well-characterised, multidrug-resistant Enterobacterales isolates obtained from hospitalised patients. By combining high-quality genome sequencing with detailed analyses of their gene content and mobile genetic elements (MGEs), this collection provides a realistic and experimentally tractable system to study how resistance evolves and spreads. The RyC collection will facilitate research on AMR dynamics, plasmid biology and host-MGEs interactions, ultimately contributing to the development of more effective strategies to combat antibiotic-resistant infections.

3
Genomic epidemiology and transmission dynamics of plasmids carrying New Delhi metallo-β-lactamase (blaNDM) at a single hospital system over five years

Raabe, N. J.; Mills, E. J.; Bapat, S.; Griffith, M. P.; Shutt, K.; Waggle, K. D.; Sundermann, A. J.; Shields, R. K.; Pless, L.; Snyder, G. M.; Harrison, L. H.; Van Tyne, D.

2026-05-18 infectious diseases 10.64898/2026.05.14.26353212 medRxiv
Top 0.1%
27.4%
Show abstract

Background: Conjugative plasmids encoding New Delhi metallo-beta-lactamase (blaNDM) pose a threat for the spread of carbapenem resistance among healthcare acquired pathogens. Plasmid-associated outbreaks of blaNDM-producing bacteria can involve multiple bacterial species and persist over long time periods, making their detection and control difficult. We systematically studied the genomic epidemiology of blaNDM-encoding plasmids detected within a single hospital system over a five-year period. Methods: blaNDM-producing isolates were collected from clinical cultures as part of the Enhanced Detection System for Healthcare-Associated Transmission (EDS-HAT) genomic sequencing active surveillance program, or during infection prevention and control (IP&C) investigations. Isolates were identified as blaNDM producers by polymerase chain reaction (PCR); the presence of plasmid-encoded blaNDM genes was confirmed by sequencing on both Illumina and Oxford Nanopore platforms. Plasmids were clustered using Pling and bacterial relatedness of host isolates was evaluated with split kmer analysis. Electronic health record data were used to identify shared unit-level spatiotemporal exposures and epidemiologic links within both plasmid and host clusters. Results: We identified 61 blaNDM-producing isolates collected from 54 patients sampled between November 2020 and July 2025. Isolates belonged to 15 Enterobacterales species; Enterobacter hormaechei was the most frequently sampled species (n=23, 37%), and blaNDM-5 was the most frequently observed blaNDM allele (n=36, 59%). We observed six clusters of genetically similar blaNDM-encoding plasmids each containing 2-28 isolates, and eight singleton plasmids. The two largest plasmid clusters consisted of a highly conserved 46 kb IncX3 family blaNDM-5-encoding plasmid (n=28 plasmids, 9 species) and a more variable 98-201 kb IncC family blaNDM-1-encoding plasmid (n=12 plasmids, 6 species). Epidemiologic investigation paired with whole genome sequencing identified spatiotemporal associations between shared patient exposures and putative plasmid and bacterial transmission clusters, suggesting that unit-level exposures contribute to plasmid dissemination. Finally, analysis of publicly available sequences showed that the most prevalent plasmids detected, IncX3(blaNDM-5) and IncC(blaNDM-1), also demonstrated high global prevalence. Conclusions: This study demonstrates the diversity of blaNDM carrying plasmids within a single hospital system and their capacity to cause prolonged, multispecies outbreaks. Integrating whole genome sequencing with epidemiologic data identified unit-level spatiotemporal overlap as a likely contributor to plasmid dissemination in the hospital.

4
PCR-free, targeted genomic sequencing using Dynamically optimized reference Adaptive Sampling (DORAS)

Borcard, L.; Gempeler, S.; Terrazos Miani, M. A.; Casanova, C.; Ramette, A.

2026-05-29 genomics 10.64898/2026.05.26.727915 medRxiv
Top 0.1%
22.5%
Show abstract

Whole genome sequencing (WGS) has become a cornerstone of clinical microbiology, enabling comprehensive analysis of microbial genome diversity. However, WGS is often computationally intensive and time-consuming when applied to specific applications like multilocus sequence typing (MLST), where only a subset of genes is only needed for typing. This study evaluates the potential of adaptive sampling (AS), a software-based solution available on Oxford Nanopore Technologies (ONT) devices, to optimize sequencing runs for MLST by reducing the production of unnecessary reads falling outside of the target areas. We demonstrate that AS, when used directly with the target gene sequences, does not reach sufficient target coverage when compared to WGS baseline sequencing due to inefficient read recruitment. Thus, we developed a novel, PCR-free approach, termed Dynamically Optimized Reference Adaptive Sampling (DORAS), which streamlines gene-specific enrichment by targeting genomic regions of interest and their genomic vicinity. DORAS first determines the genomic context of regions of interest for each sample, and then dynamically adjusts the length of the reference sequences during live sequencing. Consensus sequences are periodically constructed and evaluated for taxonomic classification. We demonstrate that full MLST profiles can be obtained in approximately half the time required for whole-genome sequencing to achieve 30X coverage (3 vs. 6 h), with no additional hands-on library preparation time. Validation on clinical isolates from hospital outbreaks belonging to Corynebacterium diphtheriae, vancomycin-resistant Enterococci, and routine clinical E. coli isolates, demonstrated the consistent retrieval of MLST types as compared to standard WGS methods. DORAS thus offers a cost-effective, efficient solution for routine surveillance and outbreak investigations based on MLST types in the clinical setting.

5
The impact of long-read sequencing on fungal genome assemblies: progress and disparity

Kroll, E.; Zoclanclounon, Y. A. B.; Urban, M.; Hill, R.; Hammond-Kosack, K. E.

2026-05-14 genomics 10.64898/2026.05.12.724544 medRxiv
Top 0.2%
12.3%
Show abstract

Fungal genomics has expanded rapidly over the past 30 years, and recently the pace and breath has further quickened for many taxa, although many taxonomic gaps persist. With three decades of rapid growth, fungal genomics now merits a re-examination of its history, progress, and unresolved taxonomic gaps. Here, we review the development of fungal genomics from early efforts such as the Fungal Genome Initiative to current progress driven by third-generation long-read sequencing. We have compiled and summarised publicly available fungal genomes to highlight trends in assembly quality, adoption of long-read technologies, and taxonomic representation. Notably, substantial phylogenetic gaps remain, particularly outside Dikarya, and significant challenges persist for unculturable taxa. This review identifies priorities for the fungal community, including: (1) coordinated efforts to close major taxonomic gaps across the fungal tree of life; (2) improved repository metrics to facilitate identification of high-quality assemblies; and (3) improved and standardised genome annotation which is lacking for most assemblies. Together, these steps will support the development of reliable genomic resources that capture the full breadth of diversity across the fungal kingdom, generating foundational data for comparative genomics, evolutionary biology, functional studies, genetic studies and applied research.

6
The Burden and Genomic Characterization of Shigella-Associated Diarrhea in Children Under Five in Lusaka, Zambia: A Prospective Cohort Study

Chibuye, m. M.; Harris, V. C.; Brizuela, J.; Bosomprah, S.; Simuyandi, M.; Mwape, K.; Silwamba, S.; Liswaniso, F.; Chibesa, K.; Miti, S.; Piedade, G.; Luchen, C. C.; Chisenga, C. C.; Mende, D. R.; Schultsz, C.; Chilengi, R.

2026-05-21 epidemiology 10.64898/2026.05.14.26353268 medRxiv
Top 0.2%
12.2%
Show abstract

Background: Shigella is a leading cause of childhood diarrhea in low- and middle-income countries and is increasingly resistant to first-line antibiotics. We conducted a surveillance study to determine the incidence, genomic characteristics, and AMR profiles of Shigella infections in children under five with moderate to severe diarrhea (MSD) in Lusaka, Zambia. Methods: Between 15 September 2020 and 30 November 2021, a prospective cohort study of 1,400 children under five was enrolled during a community census in a peri-urban setting and passively followed for 9.5 months for MSD. During enrollment, socio-demographic data were collected using electronic questionnaires, while clinical data were collected through the DHIS platform. The main outcome, Shigella in diarrheal stool in under 5 children, was detected using culture and Loop-mediated Isothermal Amplification (LAMP) targeting the ipaH gene. Cox proportional hazards models were used to assess the incidence and risk factors of Shigella (ipaH) infections. Whole-genome sequencing (WGS) was used to characterize the genomic diversity and antimicrobial resistance genes, complemented by phenotypic antibiotic susceptibility testing. Results: There were 230 first episodes of Shigella over a follow-up time of 9,581.7 child-months, yielding an incidence of 24.0 (95% CI 21.1-27.3) cases per 1,000 child-months, with the highest incidence among 2 to 3-year-olds. The key risk factors identified were the water source (p=0.025) and age group (p=0.014). Genotypic characterization revealed 10 S. flexneri, 9 S. sonnei, and 3 S. boydii. The S. sonnei isolates formed two clusters, differing in virulence factors and plasmid profiles, indicating two possible circulating strains. Shigella isolates exhibited phenotypic and genotypic multidrug resistance, including against trimethoprim, aminoglycosides, and beta-lactams. Plasmid-mediated quinolone resistance (qnrS1) was identified in four S. flexneri isolates, with these genes located on the IncFIB(K) plasmid, highlighting the potential for horizontal transmission and spread of quinolone resistance in this region. No phenotypic and genotypic resistance to macrolides, the first-line treatment for Shigella in Zambia, was observed. Interpretation: We report a high burden of Shigella with multidrug resistance, including resistance to fluoroquinolones. These findings highlight the increasing resistance of Shigella to first-line antibiotics and underscore the importance of developing safe and effective vaccines, improving WASH conditions, and ongoing AMR surveillance. Funding: The EDCTP2 program, supported by the European Union, the Faculty for the Future Foundation (FFTF), the Netherlands Organization for Health Research and Development (ZonMw), and Health-Holland AMR-Global, Gloria, and Track-AMR.

7
A comparison of scalable approaches for the pairwise analysis of large pathogen genomic and spatial datasets: an application to studying Mycobacterium tuberculosis transmission

Lan, Y.; Wu, C.-Y.; Lin, H.-H.; Cohen, T.; Warren, J. L.

2026-05-21 microbiology 10.64898/2026.05.21.726848 medRxiv
Top 0.2%
10.3%
Show abstract

Pairwise analysis of genomic and spatial data offers opportunities to identify and estimate the associations between covariates and the transmission of pathogens between individuals. However, such pairwise analyses are computationally intensive, and may not be feasible to conduct given the high dyad count in even moderately sized datasets. Here we compare two approaches to increase the efficiency of pairwise analysis for large datasets. We quantify and compare the performance of divide-and-conquer Bayesian model fitting and pairwise case-control approaches for estimating associations between individual- and pair-level covariates and shared membership in a transmission cluster. We utilize a large dataset (n=4,154) of spatially-referenced, genomically-sequenced Mycobacterium tuberculosis isolates collected from a single city for this analysis. We find that the case-control approach produces unbiased estimates of effect sizes with expected credible interval coverage and is more robust than the divide-and-conquer method when effect sizes are large. Thus, we recommend using the case-control approach with at least three controls per case to downscale datasets for pairwise analysis when analysis of the entire dataset is not possible. This approach mitigates the computational challenges of pairwise Bayesian modeling on datasets that require significant computational resources while maintaining desired inferential properties. Author SummaryPairwise analyses of large datasets to study pathogen transmission are computationally demanding because they typically require simultaneous analysis of each possible pair of individuals in a dataset; as datasets become larger these analyses often are not feasible to conduct even with access to high-performance computing resources. In this work, we compare a case-control approach and divide-and-conquer approaches for more efficient pairwise analysis of large datasets. Using a large dataset of Mycobacterium tuberculosis isolates including genetic and spatial data, we investigate the performance of each method for estimating the associations between host covariates and genetic clustering of isolates. We find that the case-control approach is generally preferred over methods which first divide the data into subsets and then combine results. While additional extensions of these analyses are needed to test the generality of these findings to other data settings, this work provides a practical way forward for the pairwise analysis of large datasets to study pathogen transmission.

8
Clinical Campylobacter jejuni isolates: genomes and genetic tools

Nasrollahi, V.; Foo, G. W.; Jaafar, T.; Elzagallaai, A. A.; Rieder, M. J.; Karas, B. J.

2026-05-21 synthetic biology 10.64898/2026.05.21.726778 medRxiv
Top 0.2%
10.1%
Show abstract

Campylobacter jejuni is a major cause of food-borne gastroenteritis and is responsible for substantial mortality and economic losses in meat and dairy production. Detecting C. jejuni in contaminated food samples remains difficult because current assays are culture-based, slow, and can yield false positives. As a result, contamination may not be identified for several days, limiting detection at the point of production. Developing improved assays has also been challenging because Campylobacter genetics and the biology of clinical isolates remain poorly understood. Here, we expand the C. jejuni genetic toolbox by sequencing two strains, HC1 and RM1164, derived from patient and food samples. We identified two cryptic plasmids in HC1, one potentially capable of conjugation and another conferring tetracycline resistance. We also engineered a mobilizable plasmid carrying an OriT sequence that can be transferred from Escherichia coli donor strains to C. jejuni RM1164 by conjugation. Together, these clinical isolates and the plasmid system expand the genetic tools available for C. jejuni.

9
Integrating patient movement and pathogen genomics to support hospital infection prevention with PathoPath: a method development study

Sajib, M. S.; Tanmoy, A. M.; Kanon, N.; Jui, A. B.; Islam, M. S.; Dola, N. Z.; Hossain, M. M.; Mobarak, R.; Shahidullah, M.; Hoque, M.; Ahmed, A. N. U.; Holmes, A. H.; Saha, S. K.; Saha, S.; Wan, Y.; Hooda, Y.

2026-06-05 infectious diseases 10.64898/2026.06.03.26354630 medRxiv
Top 0.2%
10.1%
Show abstract

Background Healthcare-associated infections pose a major burden to neonatal health worldwide and remain difficult to track in low-resource hospitals because patient movement data and pathogen genomic data are rarely integrated into actionable transmission models. Existing approaches are often restricted to specific settings, highly structured electronic health records (EHRs), or analyses focused on either patient movements or pathogen characteristics alone. To address this gap, we developed PathoPath, an open-source integrative modelling platform, and evaluated its utility in a high burden paediatric hospital in Dhaka, Bangladesh. Methods PathoPath is an open-source R package that combines electronic health records with whole genome sequencing data to generate contact networks from direct and indirect contacts using minimal structured inputs. We retrospectively applied PathoPath to 373 cases of Klebsiella pneumoniae species complex (KpSC) infection identified in 2021 at the largest paediatric referral hospital in Dhaka, Bangladesh. Ward level patient movement trajectories were used to reconstruct contact networks, and genomic data from isolates from children <60 days were integrated to identify probable dissemination of bacterial clones and antimicrobial resistance plasmids. Findings PathoPath identified 750 direct contacts among 317 patients, forming 25 connected components, with the largest including 93 patients. KpSC infections were identified across 21 of 37 wards, with the neonatal intensive care unit accounting for 77.9% of all cases. Integration of genomic and network data distinguished sustained clustering of ST147 from multiple probable inter-clonal dissemination events involving IncFII plasmids carrying blaNDM-5 and/or blaOXA-181 within ST16. Four dominant sequence types accounted for 65.6% of sequenced isolates, and carbapenemase genes were detected in 95.8%. Interpretation PathoPath reconstructs hospital-wide contact networks and integrates them with pathogen genomics to map probable dissemination of pathogens and antimicrobial resistance using minimal structured clinical data. It could support more targeted infection prevention and control in hospitals where granular digital records are not available.

10
Biophysical and enzymatic comparison of Bacillus safensis and Bacillus subtilis malate dehydrogenase (MDH) enzymes

Zafiropoulo, H. R.; Thomas, J. E.; Cortez, N. R.; Apostol, K.; de Sa, A.; Khosravi, R.; Moore, L.; Berndsen, C. E.; Bibel, B.

2026-05-14 biochemistry 10.64898/2026.05.13.723581 medRxiv
Top 0.2%
9.9%
Show abstract

Species of Bacillus bacteria including Bacillus safensis and Bacillus subtilis are finding increasing uses in biotechnology and bioremediation, thanks in part to their metabolic robustness. Malate dehydrogenase (MDH) is at the heart of central metabolism and thus a better understanding of Bacillus MDH proteins could aid in the optimization of these applications. MDH of Bacillus spp. belong to the lactate dehydrogenase (LDH)-like class of MDHs, otherwise known as the MDH3 class. Despite wide prevalence in nature among prokaryotes and archaea, this typically homotetrameric class is understudied compared to the MDH1 and MDH2 classes found in eukaryotes. We therefore recombinantly expressed and purified MDH proteins from two societally relevant Bacillus spp.-B. safensis and B. subtilis-and characterized them biophysically (via Size Exclusion Chromatography-Small Angle X-ray Scattering (SEC-SAXS) and Differential Scanning Fluorimetry (DSF)) and enzymatically (via spectroscopic activity assays). As expected based on their high sequence identity, the two MDH orthologs had similar properties in most regards, including a tetrameric structure and high susceptibility to substrate inhibition. However, we uncovered differences in conditional thermal stability, in addition to subtle differences in enzymatic activity that offer insight into the workings of LDH-like MDH. Summary statementMalate dehydrogenase (MDH) is a fundamental metabolic enzyme, from microbes to mammals, yet comparably little is known about microbial MDH, especially MDH of the tetrameric MDH3 class. We compare the biophysical and enzymatic properties of two such enzymes from the societally relevant bacterial species Bacillus subtilis and Bacillus safensis, offering useful insight with potential biotechnological implications.

11
Methodological Evaluation and Data Resource for Andes Virus Sequencing Preparedness

Doherty, R.; Lewandowski, K.; Fenwick, A.; Everall, I.; Morley, D.; Hartman, H.; Staplehurst, S.; Kent, C.; Loman, N. J.; Quick, J.; Pullan, S. T.

2026-05-16 genomics 10.64898/2026.05.15.725146 medRxiv
Top 0.2%
9.1%
Show abstract

As part of preparedness activities supporting pathogens classified under the UK High Consequence Infectious Diseases (HCID) framework, we previously evaluated both a whole-genome tiling amplicon sequencing scheme and a pan-viral hybridisation capture approach (TWIST-CVRP) for sequencing Andes virus (ANDV). In light of the recent outbreak, we make available viral sequencing datasets generated using a historical ANDV isolate (Chile, 1997). In addition, we provide an evaluation of tiling amplicon scheme performance and present recommended primer updates informed by in silico comparison with the recently released outbreak genome. These datasets are intended to support benchmarking, validation, and optimisation of bioinformatic pipelines across the community.

12
A Conditional Random Field approach for de novo reconstruction of bacterial haplotypes from a de Bruijn graph representation

Steyaert, A.; Van Hecke, M.; Marchal, K.; Fostier, J.

2026-05-12 bioinformatics 10.64898/2026.05.11.724222 medRxiv
Top 0.2%
8.5%
Show abstract

BackgroundDetecting distinct bacterial strains in a mixed sample is an important, yet less well-developed aspect of metagenomic research. Several methods exist that successfully retrieve a de novo reconstruction of viral strains. However, the reconstruction of bacterial haplotypes poses its own distinct challenges, and methods that successfully reconstruct full genome-length bacterial strains de novo are scarce. Here, we develop HaploDetox, a method for de novo bacterial haplotype reconstruction from short reads. We use a de Bruijn graph representation of the reads in which nodes correspond with k-mers from the read set and arcs represent overlap between two nodes sequences. Our aim is to accurately assign labels to each node and arc in the graph to reveal the presence or absence of their corresponding sequence in individual strains. ResultsUsing a negative binomial mixture model, we model the relationship between the read coverage of nodes and arcs in the graph and their presence in a strain. We achieve improved labelling accuracy by including contextual information from neighbouring nodes and arcs with a Conditional Random Field. These labels are used to extract strain-specific de Bruijn graphs from the original graph. Additionally, we allow users to assess the number of strains present in the dataset based on model selection criteria. We evaluate our node/arc labelling accuracy on simulated datasets and in silico mixes of real datasets containing different numbers of strains, as well as on in vitro mixed real datasets. Existing de novo haplotype reconstruction methods present their reconstruction as strain-specific sets of SNPs. We demonstrate that HaploDetox assigns strain-specific SNPs with a higher recall and similar precision than existing methods, by aligning the unitigs from strain-specific graphs to a reference genome. ConclusionsWe achieve improved strain-specific SNP phasing accuracy as compared to existing methods for de novo bacterial haplotype reconstruction. Additionally, HaploDetox is not limited to the determination of strain-specific SNPs, and other types of variant calls can be obtained through reference alignment. Finally, strain-specific de Bruijn graphs are an important first step towards full genome-length bacterial haplotype-aware assembly.

13
Insertion sequence elements associated with Staphylococcus epidermidis evolution in persistent orthopaedic device-related infections

Littlefair, J. C.; Kobras, C. M.; Post, V.; Pascoe, B.; Baker, D. J.; Erichsen, C.; Stracy, M.; Moriarty, F.; Sheppard, S. K.

2026-05-24 genomics 10.64898/2026.05.21.726754 medRxiv
Top 0.2%
8.5%
Show abstract

BackgroundStaphylococcus epidermidis is a major cause of orthopaedic device-related infections (ODRIs), which are often challenging to treat due to their extensive antimicrobial resistance (AMR) and biofilm formation. It has been hypothesised that S. epidermidis may rapidly adapt to the medical device niche, enhancing persistence, but direct evidence of within-host pathoadaptive evolution remains limited. ResultsTo investigate within-host evolution during chronic infection by S. epidermidis, we analysed isolates from patients with confirmed ODRIs and used a rat infection model to examine the evolution of strains from two distinct epidemic lineages (ST2 and ST23). Our analysis revealed that the replicative transposition of insertion sequence (IS) elements within the accessory genome was the predominant mechanism of genetic diversification. This was largely driven by the IS256 family, which accounted for approximately 25% of all mutational events. However, other than SCCmec deletions resulting in the loss of mecA, no mutations, including those which exhibited parallel evolution, were predicted or observed to influence AMR or biofilm formation. These findings suggest that the strains investigated in this study, which already exhibited high-level multidrug resistance and biofilm-forming ability, were likely pre-adapted epidemic S. epidermidis clones well suited to establishing persistent ODRIs. ConclusionsOur findings highlight the prominent role of IS elements in driving genetic diversification in S. epidermidis, underscoring the need for closer examination of their contribution to pathoadaptation during persistent infection.

14
A novel long-amplicon rpoB primer pair for high resolution microbiome analysis at the species-level

Venbrux, M.; Crauwels, S.; Rediers, H.

2026-05-17 molecular biology 10.64898/2026.05.15.725465 medRxiv
Top 0.2%
8.4%
Show abstract

The 16S rRNA gene is the most widely used genetic marker for microbial community profiling, but its limited sequence divergence often prevents species-level identification. The RNA polymerase {beta}-subunit gene (rpoB) offers higher sequence variability, single-copy occurrence, and stronger phylogenetic consistency, yet its adoption in metataxonomic studies has been constrained by the lack of universal primer sets. Here, we present a novel universal primer pair that amplifies an [~]1,800 bp rpoB region (rpoB_MV) compatible with long-read sequencing platforms. In silico evaluation across 17683 bacterial reference genomes demonstrated high universality, with over 86% of genomes predicted to amplify. Compared with full-length and partial 16S rRNA gene markers, the rpoB_MV amplicon exhibited significantly greater inter-species sequence divergence and improved phylogenetic concordance with core-genome trees. Sequencing of two complementary mock communities confirmed superior species-level identification accuracy, with misclassification rates below 0.01% and no reads assigned to unresolved species clusters. These results establish rpoB_MV as a robust alternative to 16S rRNA gene-based profiling for high-resolution metataxonomic applications. IMPORTANCEMicrobial community studies increasingly require species-level resolution because species within the same genus can differ substantially in pathogenicity, ecological function, and metabolic capacity. Current 16S rRNA gene-based methods frequently fail to distinguish closely related species, collapsing biologically distinct organisms into the same taxonomic assignment and obscuring community differences that matter for clinical diagnostics, food safety, and environmental monitoring. The rpoB_MV primer pair presented here overcomes this limitation by targeting a longer, more variable region of the rpoB gene, enabling accurate species-level identification across diverse bacterial phyla. Combined with advances in long-read sequencing, this approach provides researchers with a practical tool to resolve microbial communities at the species-level.

15
Paired viromics resolves the modular ecological architecture of the swine nasopharyngeal phageome

Mencia-Ares, O.; Deneke, C.; Martinez-Martinez, S.; Malorny, B.; Gutierrez-Martin, C. B.; Gruetzke, J.

2026-05-29 microbiology 10.64898/2026.05.28.728360 medRxiv
Top 0.2%
8.4%
Show abstract

BackgroundBacteriophages are recognized modulators of microbiome composition and function, yet their role in the porcine upper respiratory tract, a primary gateway for pathogen colonization in the post-weaning period, remains unexplored. Unlike the porcine gut, no reference framework is available for respiratory sites. Furthermore, the low-biomass nature of nasopharyngeal specimens makes virome recovery highly sensitive to extraction strategy, but the extent to which workflow choice shapes ecological inference in this niche has not been evaluated. ResultsWe profiled the nasopharyngeal phageome of post-weaning piglets across ten commercial farms (30 pen-level pools) using paired DNA-microbiome (DNA-m) and virus-like particle-enriched (VLP-e) short-read metagenomics (n = 60 libraries). Protocol choice strongly reshaped viral recovery (PERMANOVA R{superscript 2} = 0.448, p < 0.0001), with a contig overlap between workflows of <1%. DNA-m favored assembly contiguity, while VLP-e maximized viral detection. By integrating both approaches, we constructed a curated catalogue of 2,501 non-redundant viral operational taxonomic units (vOTUs), with only 5.2% showing similarity to known phages, underscoring the extensive novelty of this niche. Ecologically, within the integrated community dataset (n = 4,357), predicted replication strategy emerged as a dominant organizing axis: lifestyle explained up to 40.6% of compositional variation at family level. Host prediction linked phages to dominant upper-airway colonizers, including Streptococcaceae, Moraxellaceae, Pasteurellaceae, with a marked lifestyle-host polarization: virulent phages were preferentially linked to Bacteroidota (particularly Prevotella), whereas temperate phages were enriched in Streptococcaceae and Moraxellaceae. Integration of viral taxonomy and host affiliation resolved a modular architecture in which a few recurrent phage-host couplings (e.g., Suoliviridae-Bacteroidota, Peduoviridae-Pasteurellaceae, Aliceevansviridae-Streptococcaceae) were conserved but differentially weighted between virulent and temperate fractions. ConclusionsThis study establishes the first phageome catalogue and ecological framework for a respiratory site in livestock. The nasopharyngeal phageome is organized into recurrent, host-linked taxonomic modules jointly constrained by viral lineage, host affiliation and replication strategy, with lifestyle-dependent connections to key colonizers implicated in the porcine respiratory disease complex. This catalogue and its modular architecture provide a foundation for investigating phage-mediated modulation of bacterial dynamics during the post-weaning transition and for the selection of lytic phage candidates targeting respiratory pathogens.

16
A protocol for the TRACS-Liverpool study, tracking transmission of extended-spectrum beta-lactamase producing Enterobacterales across health and social care settings in the United Kingdom

Gallichan, S.; Lewis, J. M.; Forrest, S.; Moore, M.; Picton-Barlow, E.; McKeown, C.; Jewell, C. P.; Todd, S.; Graf, F. E.; Feasey, N. A.

2026-05-15 infectious diseases 10.64898/2026.05.13.26352872 medRxiv
Top 0.3%
8.3%
Show abstract

Background: Antimicrobial resistance (AMR) is a global public health problem. Infections caused by extended-spectrum beta-lactamase (ESBL) and carbapenemase (CP) -producing Enterobacterales (E) threaten individuals and healthcare systems worldwide. Symptomatic infection caused by Enterobacterales is typically preceded by asymptomatic colonisation and often occurs in the most vulnerable individuals, thus interrupting asymptomatic transmission is desirable. The dominant transmission routes across the healthcare continuum including hospitals, intermediate care, and long-term care facilities are not well understood. Methods: Here we present a protocol describing a genomic surveillance framework developed for the Tracking Antimicrobial Resistance Across Care Settings (TRACS) Liverpool programme, which aims to identify critical ESBL-E transmission points in hospitals and care homes in Liverpool, UK. Our study integrates individual participant and healthcare facility data, validated standard operating procedures for taking and culturing stool, rectal, environmental, and staff samples, and genomic sequencing of ESBL-E, and statistical modelling approaches into a research framework for ESBL-E genomic surveillance. Discussion: There is a need for improved epidemiological and laboratory approaches to studying bacterial transmission. Drug-resistant enteric bacteria are a highly tractable marker of the movement of all enteric bacteria, and interventions designed to interrupt transmission of drug-resistant bacteria are expected to have a broader healthcare impact. This protocol provides a standardised, reproducible approach for identifying ESBL-E, tracking acquisition events, and linking clinical and environmental isolates through whole-genome sequencing.

17
Carbohydrate active enzymes in Pectobacteriaceae: coevolving enzyme sets and host adaptation

Hobbs, E. E. M.; Gloster, T. M.; Pritchard, L.

2026-05-12 bioinformatics 10.64898/2026.05.08.723719 medRxiv
Top 0.3%
7.0%
Show abstract

Many phytopathogenic bacteria have evolved large, diverse arsenals of Carbohydrate Active enZymes (CAZymes) that liberate simple sugars, and thus nutrition and energy, from the complex lignocellulosic matrices of their plant hosts. The CAZyme arsenals of these phytopathogens are expected to be influenced by and adapted to the cell wall composition of their plant hosts. The solutions these organisms have reached for the problem of degrading plant material may help us understand their host ranges and present a rich source of novel CAZymes for exploitation in industrial bioprocessing. Here we catalogue and analyse CAZyme complements (CAZomes) of publicly-available Enterobacterial phytopathogen genomes, including those of the economically significant and widely-studied Pectobacterium and Dickeya genera. These comprise a broad diversity of CAZymes, providing insight into host adaptation and a resource for bio-prospection of industrially-relevant enzymes. We find evidence supporting coevolution of sets of CAZymes specific to bacterial genus and species and, notably, CAZymes associated with pathogen preference for either woody or soft plant tissue, suggesting adaptation of CAZomes to host plant cell wall composition.

18
NAP: an open-source pipeline for cross-domain microbiome profiling using Nanopore sequencing-derived amplicon data

Jones, L. B.; Bagby, S.

2026-05-26 bioinformatics 10.64898/2026.05.22.727110 medRxiv
Top 0.3%
6.9%
Show abstract

BackgroundNanopore sequencing offers a cost-effective and portable platform for microbiome analysis, but amplicon-based approaches remain limited by higher sequencing error rates and a lack of workflows tailored to mixed domain ribosomal RNA profiling. While short-read technologies dominate microbial community analysis, their portability and flexibility are constrained. There is therefore a need for robust pipelines designed specifically for cross-domain Nanopore amplicon data. ResultsWe introduce the Nanopore sequencing-based Amplicon Pipeline (NAP; https://github.com/Luke-B-Jones/NAP), an open-source workflow optimised for flexible mixed domain primer sets such as 515Y/926R. NAP performs adaptive quality filtering, chimera removal, centroid generation, BLAST-based taxonomic classification, hierarchical consensus correction, and domain-aware post-processing, outputting decontaminated abundance tables suitable for downstream analysis. Initial validation against two complementary commercial mock communities showed that NAP achieved strong genus-level performance across both low complexity logarithmic and more compositionally complex gut mock communities. Detection was most reliable above ca. 1% relative abundance, and replicate outputs showed strong agreement with expected composition under Bray-Curtis, Jaccard, agreement-plot, and Bland-Altman analyses. Benchmarking of NAPs internal filtering modes showed that the default adaptive setting provided the most robust balance of read quality, retained depth, and downstream taxonomic fidelity across heterogeneous inputs. Direct comparison against QIIME2 and Kraken2/Bracken further showed that NAP most accurately preserved expected community structure, with markedly fewer false positive assignments at genus level and substantially stronger species-level behaviour under the tested conditions. Species-level assignments were informative for some taxa, but remained less robust than genus-level outputs with the default V4-V5 amplicon. ConclusionsNAP provides a robust and flexible workflow for cross-domain Nanopore amplicon profiling, with strongest performance at genus level and competitive species-level behaviour for well resolved taxa. Although analysis of field-derived data was not assessed here, NAP compatibility with portable Nanopore sequencing supports accurate mixed domain microbiome profiling under the tested conditions.

19
gTranslate: rapid and accurate translation table prediction for prokaryotic genomes

Chaumeil, P.-A.; Hugenholtz, P.; Parks, D. H.

2026-05-28 bioinformatics 10.64898/2026.05.24.727570 medRxiv
Top 0.3%
6.4%
Show abstract

BackgroundBioinformatic tools often require the prediction of protein-coding genes to make inferences about prokaryotic genomes. Typically, the genetic code used for translating genes to proteins must be specified by the user based on the taxonomic classification of a genome assembly or, for some widely used tools, established using a heuristic rule based on gene coding densities. Manual specification is at best inconvenient, but more challenging is that many bioinformatic tools are applied before taxonomic classifications have been established making specifying the translation table impractical. MethodsHere we provide a computationally efficient tool, gTranslate, that uses an ensemble of five machine learning methods to accurately predict translation tables for prokaryotic genomes. The feature vector used by gTranslate takes advantage of differences in gene coding densities when predicting genes under different translation tables along with features that consider the number and ratio of UGA stop codon reassignments to tryptophan or glycine. ResultsWe demonstrate that gTranslate correctly predicts the translation table of prokaryotic genomes >99.99% of the time (i.e. <1 error per 10,000 genomes) and outperforms a more computationally expensive prediction method and a coding density heuristic used by popular bioinformatic tools. Using gTranslate, we identify a basal lineage of Ca. Stammera capleta that uses the standard bacterial genetic code instead of the UGA stop codon to tryptophan reassignment common to other members of this species. We also identify the first instances of UGA-to-tryptophan reassignment in the Patescibacteriota making this the first bacterial phylum with members capable of using translation tables 4, 11, and 25.

20
Development and validation of a multilocus sequence typing scheme for Fasciola hepatica using next-generation deep amplicon sequencing

Abbas, M.; kozel, K.; Daramola, O.; Selemetas, N.; Robinson, M. W.; Morgan, E. R.; Chaudhry, U.; Betson, M.

2026-05-22 genetics 10.64898/2026.05.20.726500 medRxiv
Top 0.3%
6.3%
Show abstract

Fasciolosis caused by Fasciola hepatica is an economically important disease in sheep and cattle. Knowledge of the population genetic structure of F. hepatica is important for understanding gene flow and informing disease control. In the present study, we designed, developed, and validated a multilocus sequence typing (MLST) scheme based on six markers. These markers were selected by aligning newly sequenced whole-genome sequence (WGS) data with available reference genomes and selecting variable regions with five or more single-nucleotide polymorphisms SNPs from different scaffolds of the F. hepatica reference genome Fasciola 10x pilon (GCA_900302435.1). Twenty markers were initially identified, of which 12 were multiplexed for deep amplicon sequencing after validation on worm and faecal eggs DNA; six markers were ultimately retained for downstream population genetics analysis. These markers were used to investigate population genetic structure in 15 cattle- and 27 sheep-derived F. hepatica populations in UK. A total of 53 unique alleles from six MLST markers were identified from 30 faecal (cattle = 13, sheep = 17) and 12 adult worm (cattle = 2, sheep = 10) populations. Shared alleles were observed in sheep- and cattle-derived populations. The highest allelic variation was observed in the Scottish Borders, Southern Scotland, and South-West England, and the lowest in North-West England. Minimal genetic differentiation was observed between cattle- and sheep-derived populations, with most genetic structuring within rather than between populations. Five markers showed high allelic polymorphism, whereas one marker showed low levels of allelic polymorphism, highlighting the importance of multilocus approaches. Overall, this six MLST-marker panel provides a tool for population genetic studies, revealing high gene flow and clonal expansion of F. hepatica across hosts and regions in the UK.